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Abstract 


In this document we compare the performance of the Amazon Web Ser- 
vices (AWS), also known as Amazon Cloud, with the clarreo cluster 
and assess its suitability for computational needs of the CLARREO mis- 
sion. A benchmark executable to process one month and one year of 
PARASOL data was used. With the optimal AWS configuration, ade- 
quate data-processing times, comparable to the clarreo cluster, were 
found. The assessment of alternatives to the clarreo cluster continues 
and several options, such as a NASA-based cluster, are being considered. 
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1 Introduction 


This study was motivated by the need to explore the alternatives to the 
expiring maintenance contract for the clarreo cluster. Amazon Web 
Service (AWS) was chosen for this study due to the potential savings 
on the IT support and maintenance costs, as well as the easy config- 
urability of the CPUs, memory and storage. An additional advantage 
of the AWS is the accessibility to the collaborators offsite. In what 
follows, we benchmark the performance of the virtual AWS-based clus- 
ter against the performance of the similarly-configured clarreo cluster 
based at NASA/LaRC. In both cases, the 2006 dataset collected by the 
PARASOL microsatellite flying as a part of A-Train formation and ac- 
tive between the years 2004 and 2013 was used. The executable merged 
and filtered the data, producing ROOT ntuples as output. A total of one 
month (250 GB) and 1 year (3 TB) of 2006 PARASOL data were pro- 
cessed and compared. For both clusters roughly 90% of the executable’s 
running time was found to be spent on the I/O operations, while the 
rest of time was spent on CPU processing. On AWS the performance 
with two types of storage, the NFS shared filesystem and S3 storage, 
was assessed. On the clarreo cluster the GPFS filesystem and the local 
disk storage was used to read and write the data. 


2 AWS Cluster Setup and clarreo Cluster Con- 
figuration 


The virtual AWS cluster [1] was set up as seven C3.8xlarge [2] compute 
optimized instances of EC2 cloud [3]. The setup relied on the open- 
source toolkit from MIT, called StarCluster [4], which has been designed 
to automate and simplify the process of building, configuring, and man- 
aging (i.e., starting and stopping the individual nodes) clusters of virtual 
machines. StarCluster features support for EBS-Backed Clusters, Open- 
MPI and languages like R, Python, C and C++. The fully redundant 
AWS Simple Storage System (S3) [5], as well as the NFS [6], filesystems 
were installed as well and their performance was compared in this study. 
For the parallel job submission the Sun Grid Engine (SGE) [7] with 168 
slots (168 CPUs) was installed. The number of the slots was set to be 
roughly equivalent to that on the clarreo cluster. 

The NASA/LaRC-based clarreo cluster consists of 15 IBM iDat- 
aPlex [8] compute nodes with 12 CPUs each. The cluster nodes are 
GPFS-mounted [9]. For the tests discussed in this document the num- 
ber of available slots varied between 150 and 154. 
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3 Processing Steps and Data Description 


PARASOL (Polarization and Anisotropy of Reflectances for Atmospheric 
Sciences coupled with Observations from a Lidar) mission [10] was active 
between 2004 and 2013 and consisted of the microsatellite flying as a 
part of A- Train formation at 705 km altitude. The mission’s primary 
aim was to study aerosols and clouds. Measured data from a single 
PARASOL orbit are distributed among several datafiles. For this study 
three datastreams, Ll-B [11], RB2 and OC2 [12], were used as input. 
The three datafiles corresponding to one orbit are read by the executable, 
whose job is then to merge and filter the data and output the results as 
a single file. In the second processing step, which was not performed in 
this study, the filtered and merged output data can be used to produce 
the desired plots. 

Ll-B (Level-IB) PARASOL input data files contain, among other 
measurements, geometric parameters, such as solar and viewing zenith 
angles, relative azimuth, and the Stokes vector components describing 
polarization measured by PARASOL. The remaining two types of input 
files contain Level-2 data corresponding to the Radiation Budget (RB2) 
and Ocean Color (OC2) streams with cloud and aerosol parameters. The 
input data files are in the binary format while the output data are in the 
form of ROOT [13] ntuples [14]. 

To test the performance and scalability of the AWS we used one 
month (January) and the entire year of 2006 PARASOL data. The sizes 
and the number of hies in each of the three input streams is shown in 
Table 1. For January 2006 there were a total of 394 hies corresponding 
to 394 PARASOL orbits for each data stream. The total disk usage for 
the three streams was 224 GB. For one year the number of hies was 4642 
per stream, with the three streams occupying 2.8 TB of disk space. The 
number of output hies is the same is the same as the number input hies 
in one stream (Table 2). The combined disk space to store the output 
was 57 GB and 660 GB for one month and one year, respectively. 


Data stream 

Single 

file 

(MB) 

Num. of 
files 
(one mo.) 

Disk space 
(GB) 
(one mo.) 

Num. of 
files 
(one year) 

Disk space 
(GB) 
(one year) 

Ll-B binary 

560 

394 

212 

4642 

2700 

RB2 binary 

20 

394 

12 

4642 

100 

OC2 binary 

0. 1-0.6 

394 

0.127 

4642 

1.5 

Total input 

580-581 

1182 

224 

13926 

2800 


Table 1. Disk space and the number of hies occupied by Ll-B, RB2 and 
OC2 input data streams by one month (January 2006) and one year 
(2006) of PARASOL data. 
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Data stream 

Single 

Num. of 

Disk space 

Num. of 

Disk space 


file 

files 

(GB) 

files 

(GB) 


(MB) 

(one mo.) 

(one mo.) 

(one year) 

(one year) 

ROOT ntuple 

50-250 

394 

57 

4642 

660 


Table 2. Output file sizes corresponding to the input data in Table 1. 

4 Performance Comparison of the Amazon Cloud 
vs the clarreo Cluster Using PARASOL Data 

In this section we describe the tests performed on the Amazon Cloud 
using two types of storage, S3 and NFS. These results are then compared 
against two tests performed on the clarreo cluster: one, using the GPFS 
filesystem and the second, using local disks. 

4.1 AWS S3 Test 

For this test the S3 filesystem was used to store input data. Ll-B, RB2 
and OC2 datafiles were copied to the local SSD volume where it was 
processed by the executable. Once the output ROOT ntuple was created, 
it was copied back to the S3 filesystem. At the end of the execution all 
the input files and the ROOT output files were deleted from SSD. 

Sun Grid Engine (SGE) was used to process the January 2006 and the 
entire 2006 PARASOL datasets in batch. The total number of submitted 
jobs on the cluster was equal to the number of orbits/files in a stream. 
The execution was timed from the beginning of submission till the last 
file was processed. The average run time to process one month of data 
was found to be 7 minutes. For the entire 2006 dataset the running 
time was found to be 57 minutes. 

We also assessed the cluster’s processing speed (see the first row of 
the last two columns in Table 3). To quantify this rate we used the 
one month of 2006 data, running the test twice with January data and 
once with June data. Within the C++ executable we used the ctime 
library calls to record the transfer rates from a local SSD volume to S3 
of the three input files and from the SSD volume to S3 of the output 
ROOT ntuple. The execution time is then the total running time of 
the executable minus the time to transfer the input and output files. 1 
The processing time was computed for each job, and the percentages 
quoted in the last two columns in Table 3 are averages over the number 
of submitted jobs. 

4.2 AWS NFS Tests 

For the AWS NFS test the cloud 7 nodes were configured with 32 SGE 
slots, for a total of 224 SGE slots. Two Amazon volume types were used: 

lr The execution time is thus composed of the pure CPU processing time, plus the 
time spent reading the input and writing the output files (I/O). 
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the Provisioned IOPS (iol) and the General Purpose (gp2) volumes. 
Two tests, one for each volume type, were run. The total time to run 
the January 2006 dataset was 27 min on IOPS (Table 3). In the case of 
gp2 the run time was deemed to be too long and the run was terminated 
after 50 min. Compared to the AWS S3 and the clarreo cluster the 
performance on the NFS filesystem is considered to be inadequate. Due 
to the slow performance of one month processing the entire- year run was 
not attempted. 

4.3 clarreo GPFS Test 

In this test we ran the January 2006 and the dataset with the entire year 
of data on the clarreo cluster. The input, output and the executable 
were located on three separate devices linked through GPFS [9]. Sun 
Grid Engine with approximately 150 slots 2 was used to process January 
2006 data. The submission scripts and the executable were analogous 
to the AWS NFS test. For timing consistency the same January 2006 
dataset was processed three times. The average run time to process the 
January 2006 dataset was found to be 5 minutes. The time to process 
the entire 2006 dataset was 45 minutes. 

4.4 clarreo Local Disk Read/Write Test 

The input data in this case were copied from the GPFS shared filesystem 
onto each node local disk for processing. The output was written to the 
same local disk. As in the GPFS test above approximately 150 slots were 
used to process January 2006 data. Within the precision of our timing 
the average run time to process the January 2006 dataset was longer 
than the GPFS test by about one minute, at 6 minutes. The time to 
process was found to be 49 minutes. Similar to the AWS S3 test, we 
calculate the processing and data transfer rates to and from the local 
disk. The results are shown in the last row of the last two columns of 
Table 3. 

4.5 The Amazon Cloud vs. the clarreo Cluster Perfor- 
mance Summary 

In Table 3 we summarize the performances for the four test configurations 
described above. Of the two tests performed on the Amazon cloud, only 
the S3 test yielded acceptable results. Although transfering data in and 
out of the S3 filesystem slowed the throughput, the overall run time of 
57 minutes on the cloud is comparable to the clarreo GPFS test which 
ran for 45 minutes. We note (see the last two columns) that roughly 

2 Other users were executing jobs on the clarreo cluster, so for the three time trials 
the maximum number of available slots allocated for our test varied between 150 and 
154. This difference didn’t significantly affect the execution times for the January 
2006 dataset. 
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half of the total running time on AWS was spent transfering the data 
from S3 onto an local SSD volume. This is noticeably slower than our 
benchmark test of copying the GPFS mounted data to a local disk on 
the clarreo cluster which took about one quarter of the total execution 
time. 


Test type 

Total run 
time (min) 
(1 month) 

Total run 
time (min) 
(12 months) 

Percentage 

execution 

time 

Percentage file 
transfer 
time 

AWS S3 

7 

57 

55 

45 

AWS NFS 

27 

- 

- 

- 

clarreo GPFS 

5 

45 

- 

- 

clarreo Local Disk 

6 

49 

73 

27 


Table 3. Total running times to process one month and one year of 
PARASOL data for the four test configurations. Also shown are the 
percentages (averaged over the number of SGE jobs) of the total running 
time of each job spent on execution (CPU time + I/O) and transfering 
data for AWS S3 and the benchmark clarreo Local Disk tests. 


5 Conclusions 

We have shown using our test executable that the Amazon Cloud-based 
cluster configured with S3 storage results in processing times comparable 
to the clarreo cluster, while the performance with the NFS storage 
is found to be significantly inferior. Since one can easily configure a 
virtually unlimited number of nodes, the advantage of the AWS over the 
clarreo cluster is its scalability, which may result in faster processing 
time for large datasets. Among the disadvantages of the Amazon Cloud 
is the latency in opening remote GUI-based application, as we have found 
that opening remote windows, such as Emacs, can take up to 30 seconds. 
Another drawback of using the AWS is the need to transfer data into 
and out of the cloud. In conclusion, the use of the AWS as a CLARREO 
science computing facility remains a possibility, however, other options, 
such as using a NASA-based cluster, are also under consideration. 
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